A Dynamic Indexing for Incremental Entity Resolution over Query Results
نویسندگان
چکیده
Entity Resolution (ER) is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query time. This task becomes more challenging in scenarios with dynamic data sources or with a large volume of data. As most ER techniques deal with all tuples at once, new solutions have been proposed to deal with large volumes of data. One possible approach consists in performing the ER process on query results rather than the whole data. In this case, previous results of ER tasks are reused in order to reduce the number of comparisons between pairs of tuples at query time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an indexing technique for incremental Entity Resolution processes. The expected contributions of this work are the specification, the implementation and the evaluation of the proposed indexes. We evaluated the reuse of previous results of the ER process and highlighted its impact. The time spent for storing, accessing and updating the indexes was measured. We concluded that the reuse is more efficient than the reprocessing of tuples comparison.
منابع مشابه
Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution
Entity resolution is the process of identifying groups of records in a single or multiple data sources that represent the same real-world entity. It is an important tool in data de-duplication, in linking records across databases, and in matching query records against a database of existing entities. Most existing entity resolution techniques complete the resolution process offline and on stati...
متن کاملA Node Indexing Scheme for Web Entity Retrieval
Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed ...
متن کاملBenchmarking Adaptive Indexing
Ideally, realizing the best physical design for the current and all subsequent workloads would impact neither performance nor storage usage. In reality, workloads and datasets can change dramatically over time and index creation impacts the performance of concurrent user and system activity. We propose a framework that evaluates the key premise of adaptive indexing — a new indexing paradigm whe...
متن کاملEfficient Dynamic Indexing and Retrieval of XML Documents using Three- Dimensional Quasi-BitCube
XML is a new standard for exchanging and representing data on the Internet. Techniques for indexing and retrieval of XML data is drawing increasing attention since they enable one to access certain parts of retrieved documents easily. However, they provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed...
متن کاملFast Incremental Indexing for Full-Text Information Retrieval
Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collecti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016